Container Monitoring: Beyond cAdvisor

Pantalla con gráficos de rendimiento de múltiples servicios y métricas

cAdvisor was the tool that democratised container monitoring. And remains relevant — kubelet includes it internally. But in 2024, observing containers well requires more layers: cluster state metrics, eBPF for deep visibility, APM for application context. This article maps what to combine and how.

The Modern Minimum Stack

For serious Kubernetes in 2024:

  • kubelet / cAdvisor: CPU, memory, network, disk metrics per container.
  • kube-state-metrics: state of Deployments, Pods, ReplicaSets, HPA.
  • node-exporter: node metrics.
  • Prometheus scrapes everything, aggregates.
  • Grafana visualises.

This covers 80% of what you need to monitor. And is a solid OSS stack.

What’s Missing Without eBPF

cAdvisor gives “surface” metrics:

  • CPU usage %.
  • Memory RSS.
  • Network bytes.
  • Disk I/O.

But not:

  • Syscall latency: is the pod stuck on I/O?
  • Network latencies between specific pods.
  • CPU profile: which functions consume.
  • Function-level detail: hotpaths.

For this, eBPF is the modern tool.

eBPF: The Changing Layer

Modern eBPF tools:

Pixie

Pixie (CNCF sandbox, originally New Relic):

  • Auto-instrumentation HTTP/gRPC/DNS without sidecar or code changes.
  • Live flame graphs.
  • Automatic service map.
  • PxL-language queries.

One per-node eBPF agent + web UI. Developer-friendly.

Grafana Beyla

Beyla:

  • Auto-instrumentation for Go, Java, Node apps.
  • Generates OpenTelemetry traces without code modification.
  • Grafana stack integration.

Simpler than Pixie, focused on traces/metrics.

Parca

Parca:

  • Continuous profiling of the whole cluster.
  • eBPF flame graphs.
  • Grafana-integrable.

Specific for CPU profiling.

Inspektor Gadget

Kinvolk/Microsoft tool for eBPF debugging:

  • kubectl trace equivalents.
  • Per-pod network snapshots.
  • On-demand profiling.

APM: The Application Layer

eBPF gives infra visibility; APM gives application visibility:

  • OpenTelemetry: open standard, increasingly adopted.
  • Jaeger / Tempo: trace backends.
  • Datadog / New Relic / Dynatrace: complete commercial.
  • Grafana Tempo: tempo.

With OTel SDK, your app instruments:

  • Request spans.
  • Business metrics.
  • Correlated logs.

Beyla can auto-generate some, but for business metrics, you need SDK.

Combining Without Saturating

Common error: all tools = massive overhead. Typical sweet spot:

  • cAdvisor + kube-state-metrics + node-exporter: light base.
  • eBPF (Pixie or Beyla): add when needing deep visibility.
  • APM with OTel: for critical apps, not all.
  • Commercial APM: only with clear use case vs OSS.

Each layer should add distinct value. Duplicating is waste.

Essential Per-Container Metrics

Always monitored:

  • CPU throttling: is the pod rate-limited?
  • Memory working set: real use, not RSS.
  • OOM kills: key counter.
  • Network errors: TX/RX drops.
  • Disk pressure: fullness + I/O saturation.
  • Restart count: flapping = problem.

For K8s additionally:

  • Pod phase: Pending, Running, Failed.
  • Readiness probe failures.
  • HPA desired vs current.
  • PVC usage.

Alerts Worth Having

Few but effective:

  • Pod restart > N in Y minutes: flapping.
  • Sustained CPU throttling > 50%: insufficient resource.
  • OOM kills: always investigate.
  • Memory > 90% limit sustained: leak or sizing.
  • Node not ready > X minutes: incident.
  • HPA at max replicas for > Y min: capacity issue.

Fewer useful alerts > many ignored alerts.

Dashboards: What to Show

Typical levels:

  • Cluster overview: total resources, nodes, pods.
  • Namespace: per team/application.
  • Workload: specific deploy, pods, containers.
  • Pod detail: drill-down for troubleshoot.

Official Kubernetes Grafana dashboards (IDs 315, 6417, etc) are good starting points.

Logging Integration

Metrics without logs is half the picture. Typical stack:

  • Fluent Bit or Loki-native shippers for log collection.
  • Loki for storage + Grafana for visualization.
  • Trace correlation via trace IDs.

When investigating an incident, you need metrics + logs + traces on the same timeline.

Security Observability

Complementary:

  • Falco: eBPF runtime security.
  • Tracee (Aqua): similar, eBPF-based.
  • Kubernetes API audit logs.

Not standard “monitoring” but part of the complete picture.

  • Telegraf: valid but Prometheus ecosystem is default now.
  • Standalone InfluxDB: Loki + Prometheus cover.
  • Legacy Stackdriver: GCP-only, lock-in.
  • ELK for metrics: Elastic better for logs alone.

A Practical Example

Typical 50-node, 500-pod cluster:

  • Prometheus federation: ~2000 targets, 5M series.
  • Retention: 30 days hot + object storage.
  • Grafana with 10-15 curated dashboards.
  • 20-30 useful alerts.
  • Beyla in some namespaces for traces.
  • Loki for logs.

OSS stack, ~5% overhead of total cluster in CPU/RAM.

Conclusion

Monitoring containers well in 2024 requires more than cAdvisor. The OSS base (Prometheus + kube-state-metrics + node-exporter + Grafana) is solid and sufficient for most. eBPF (Pixie, Beyla, Parca) adds deep visibility when needed. APM with OpenTelemetry complements with application vision. The trap is over-engineering: more tools = more maintenance and noise. Start with solid base, add layers when use case justifies, and maintain alert discipline — fewer well-thought ones win.

Follow us on jacar.es for more on observability, Kubernetes, and eBPF.

Entradas relacionadas